Qwen3 VL 30B A3B Instruct

About the Provider

Qwen is an AI model family developed by Alibaba Group, a major Chinese technology and cloud computing company. Through its Qwen initiative, Alibaba builds and open-sources advanced language, images and coding models under permissive licenses to support innovation, developer tooling, and scalable AI integration across applications.

Model Quickstart

This section helps you quickly get started with the Qwen/Qwen3-VL-30B-A3B-Instruct model on the Qubrid AI inferencing platform. To use this model, you need:

A valid Qubrid API key
Access to the Qubrid inference API
Basic knowledge of making API requests in your preferred language

Once authenticated with your API key, you can send inference requests to the Qwen/Qwen3-VL-30B-A3B-Instruct model and receive responses based on your input prompts. Below are example placeholders showing how the model can be accessed using different programming environments.
You can choose the one that best fits your workflow.

from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
  base_url="https://platform.qubrid.com/v1",
  api_key="QUBRID_API_KEY",
)

# Create a streaming chat completion
stream = client.chat.completions.create(
  model="Qwen/Qwen3-VL-30B-A3B-Instruct",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image? Describe the main elements."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
          }
        }
      ]
    }
  ],
  max_tokens=4096,
  temperature=0.7,
  top_p=0.9,
  stream=True,
  presence_penalty=0
)

# If stream = False comment this out
for chunk in stream:
  if chunk.choices and chunk.choices[0].delta.content:
      print(chunk.choices[0].delta.content, end="", flush=True)
print("\n")

# If stream = True comment this out
print(stream.choices[0].message.content)

Model Overview

Qwen3 VL 30B A3B Instruct is a large-scale vision-language model designed to process and reason over both text and visual inputs. It combines strong text understanding with advanced visual perception, spatial reasoning, OCR, and video comprehension. The model supports long-context workloads and agent-style interactions, enabling it to operate GUIs, invoke tools, and complete multimodal tasks. It is provided in a Mixture-of-Experts (MoE) architecture and optimized for inference across edge-to-cloud environments. This Instruct variant is intended for instruction-following and multimodal interaction.

Model at a Glance

Feature	Details
Model ID	`Qwen/Qwen3-VL-30B-A3B-Instruct`
Provider	Qwen
Architecture	Transformer decoder-only (Qwen3-VL with ViT visual encoder)
Model Size	9B
Parameters	6
Max Tokens	32K tokens
Image Input	Supported

When to use?

You should consider using Qwen3 VL 30B A3B Instruct if:

Your application requires both text and image understanding
You need long-context processing for documents or videos
Your workflows involve tool usage or agent-style interactions
You need OCR and document parsing across multiple languages
You require serverless multimodal inference

This model is suitable for inference scenarios where multimodal comprehension and extended context handling are required.

Inference parameters

Parameter Name	Type	Default	Description
Streaming	boolean	true	Enable streaming responses for real-time output.
Temperature	number	0.7	Controls randomness in the output.
Max Tokens	number	2048	Maximum number of tokens to generate.
Top P	number	0.9	Controls nucleus sampling.
Top K	number	50	Limits sampling to the top-k tokens.
Presence Penalty	number	0	Discourages repeated tokens in the output.

Key Features

Visual Agent Operations : Can operate PC and mobile GUIs by recognizing interface elements, understanding their functions, invoking tools, and completing tasks.
Advanced Visual and Spatial Reasoning : Judges object positions, viewpoints, occlusions, and supports stronger 2D grounding with enabled 3D grounding for spatial reasoning.
Long Context and Video Understanding : Supports a native 256K context window, expandable up to 1M, enabling processing of books and hours-long videos with second-level indexing.
Multimodal Reasoning : Combines visual and textual reasoning for causal analysis, logical problem-solving, and evidence-based responses, including STEM and math tasks.
Expanded OCR and Visual Recognition : Supports OCR across 32 languages and improves recognition under low light, blur, tilt, rare characters, and long-document structures.

Summary

Qwen3 VL 30B A3B Instruct is a Mixture-of-Experts vision-language model designed for multimodal inference.

It processes text, images, and long-context inputs in a unified manner. The model supports visual reasoning, OCR, video understanding, and tool usage.
It enables agent-style interactions such as GUI operation and task completion. Serverless inference is supported, while fine-tuning is not.

Getting started

GPU Compute

Inferencing

AI Tools

About the Provider

Model Quickstart

Model Overview

Model at a Glance

When to use?

Inference parameters

Key Features

Summary

Getting started

GPU Compute

Inferencing

AI Tools

​About the Provider

​Model Quickstart

​Model Overview

​Model at a Glance

​When to use?

​Inference parameters

​Key Features

​Summary

About the Provider

Model Quickstart

Model Overview

Model at a Glance

When to use?

Inference parameters

Key Features

Summary